Loan Default Prediction¶

Problem Definition¶

A major proportion of retail bank profit comes from interests in the form of home loans. These loans are borrowed by regular income/high-earning customers. Banks are most fearful of defaulters, as bad loans (NPA) usually eat up a major chunk of their profits. Therefore, it is important for banks to be judicious while approving loans for their customer base. The approval process for the loans is multifaceted. Through this process, the bank tries to check the creditworthiness of the applicant on the basis of a manual study of various aspects of the application. The entire process is not only effort-intensive but also prone to wrong judgment/approval owing to human error and biases. There have been attempts by many banks to automate this process by using heuristics. But with the advent of data science and machine learning, the focus has shifted to building machines that can learn this approval process and make it free of biases and more efficient. At the same time, one important thing to keep in mind is to make sure that the machine does not learn the biases that previously crept in because of the human approval process

The Context:¶

A bank's consumer credit department aims to simplify the decision-making process for home equity lines of credit to be accepted. To do this, they will adopt the Equal Credit Opportunity Act's guidelines to establish an empirically derived and statistically sound model for credit scoring. The model will be based on the data obtained via the existing loan underwriting process from recent applicants who have been given credit. The model will be built from predictive modeling techniques, but the model created must be interpretable enough to provide a justification for any adverse behavior (rejections).

The objective:¶

Build a classification model to predict clients who are likely to default on their loan and give recommendations to the bank on the important features to consider while approving a loan.

The key questions:¶

  • What are the borrow characteristics that lead to default

The problem formulation:¶

  • Predict likelihoood of loan going bad and what are key characteristics or profile of such loans

Data Description:¶

The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.

  • BAD: 1 = Client defaulted on loan, 0 = loan repaid

  • LOAN: Amount of loan approved.

  • MORTDUE: Amount due on the existing mortgage.

  • VALUE: Current value of the property.

  • REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)

  • JOB: The type of job that loan applicant has such as manager, self, etc.

  • YOJ: Years at present job.

  • DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments).

  • DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).

  • CLAGE: Age of the oldest credit line in months.

  • NINQ: Number of recent credit inquiries.

  • CLNO: Number of existing credit lines.

  • DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.

Import the necessary libraries and Data¶

In [71]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score,precision_score,recall_score,f1_score,roc_curve, auc

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier,plot_tree

from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier

import scipy.stats as stats

from sklearn.model_selection import GridSearchCV


import warnings
warnings.filterwarnings('ignore')

Data Overview¶

  • Reading the dataset
  • Understanding the shape of the dataset
  • Checking the data types
  • Checking for missing values
  • Checking for duplicated values

Reading the dataset

In [72]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [73]:
hm=pd.read_csv("/content/drive/My Drive/GL_MIT_ADSP/Week9_Capstone/data/hmeq.csv")
In [74]:
data = hm.copy()

Data checks

In [75]:
hm.head()
Out[75]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
0 1 1100 25860.0 39025.0 HomeImp Other 10.5 0.0 0.0 94.366667 1.0 9.0 NaN
1 1 1300 70053.0 68400.0 HomeImp Other 7.0 0.0 2.0 121.833333 0.0 14.0 NaN
2 1 1500 13500.0 16700.0 HomeImp Other 4.0 0.0 0.0 149.466667 1.0 10.0 NaN
3 1 1500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0 1700 97800.0 112000.0 HomeImp Office 3.0 0.0 0.0 93.333333 0.0 14.0 NaN
In [76]:
# check the number of rows and columns
hm.shape
Out[76]:
(5960, 13)
In [77]:
# check the data types and missings
hm.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5960 entries, 0 to 5959
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   BAD      5960 non-null   int64  
 1   LOAN     5960 non-null   int64  
 2   MORTDUE  5442 non-null   float64
 3   VALUE    5848 non-null   float64
 4   REASON   5708 non-null   object 
 5   JOB      5681 non-null   object 
 6   YOJ      5445 non-null   float64
 7   DEROG    5252 non-null   float64
 8   DELINQ   5380 non-null   float64
 9   CLAGE    5652 non-null   float64
 10  NINQ     5450 non-null   float64
 11  CLNO     5738 non-null   float64
 12  DEBTINC  4693 non-null   float64
dtypes: float64(9), int64(2), object(2)
memory usage: 605.4+ KB
In [78]:
hm.isna().sum()/hm.shape[0]*100
Out[78]:
0
BAD 0.000000
LOAN 0.000000
MORTDUE 8.691275
VALUE 1.879195
REASON 4.228188
JOB 4.681208
YOJ 8.640940
DEROG 11.879195
DELINQ 9.731544
CLAGE 5.167785
NINQ 8.557047
CLNO 3.724832
DEBTINC 21.258389

In [79]:
hm.duplicated().sum()
Out[79]:
np.int64(0)
In [80]:
hm.nunique()
Out[80]:
0
BAD 2
LOAN 540
MORTDUE 5053
VALUE 5381
REASON 2
JOB 6
YOJ 99
DEROG 11
DELINQ 14
CLAGE 5314
NINQ 16
CLNO 62
DEBTINC 4693

In [81]:
# creating list of numeric and categorical variables
num_cols = hm.select_dtypes(include=np.number).columns.tolist()
cat_cols = hm.select_dtypes(include='object').columns.tolist()
# changing 'BAD' variable from
num_cols.remove('BAD')
cat_cols.append('BAD')
print(num_cols)
print(cat_cols)
['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
['REASON', 'JOB', 'BAD']

Insights

  • Dataset has 5,960 rows and 13 columns
  • 11 variable are numerics and 2 variables are non-numerics.
  • There are no duplicates in the dataset.
  • Except 'BAD' and 'LOAN' , remaining variables have missing values 'DEBTINC' has 21% and 'DELINQ' has 11% missing, other 9 variables have less then 10% missings.

Summary Statistics¶

In [82]:
hm.describe(include='all').T
Out[82]:
count unique top freq mean std min 25% 50% 75% max
BAD 5960.0 NaN NaN NaN 0.199497 0.399656 0.0 0.0 0.0 0.0 1.0
LOAN 5960.0 NaN NaN NaN 18607.969799 11207.480417 1100.0 11100.0 16300.0 23300.0 89900.0
MORTDUE 5442.0 NaN NaN NaN 73760.8172 44457.609458 2063.0 46276.0 65019.0 91488.0 399550.0
VALUE 5848.0 NaN NaN NaN 101776.048741 57385.775334 8000.0 66075.5 89235.5 119824.25 855909.0
REASON 5708 2 DebtCon 3928 NaN NaN NaN NaN NaN NaN NaN
JOB 5681 6 Other 2388 NaN NaN NaN NaN NaN NaN NaN
YOJ 5445.0 NaN NaN NaN 8.922268 7.573982 0.0 3.0 7.0 13.0 41.0
DEROG 5252.0 NaN NaN NaN 0.25457 0.846047 0.0 0.0 0.0 0.0 10.0
DELINQ 5380.0 NaN NaN NaN 0.449442 1.127266 0.0 0.0 0.0 0.0 15.0
CLAGE 5652.0 NaN NaN NaN 179.766275 85.810092 0.0 115.116702 173.466667 231.562278 1168.233561
NINQ 5450.0 NaN NaN NaN 1.186055 1.728675 0.0 0.0 1.0 2.0 17.0
CLNO 5738.0 NaN NaN NaN 21.296096 10.138933 0.0 15.0 20.0 26.0 71.0
DEBTINC 4693.0 NaN NaN NaN 33.779915 8.601746 0.524499 29.140031 34.818262 39.003141 203.312149
In [83]:
for col in cat_cols:
  print(col)
  print(hm[col].value_counts())
  print(hm[col].value_counts(normalize=True)),
  print('-----'*10)
REASON
REASON
DebtCon    3928
HomeImp    1780
Name: count, dtype: int64
REASON
DebtCon    0.688157
HomeImp    0.311843
Name: proportion, dtype: float64
--------------------------------------------------
JOB
JOB
Other      2388
ProfExe    1276
Office      948
Mgr         767
Self        193
Sales       109
Name: count, dtype: int64
JOB
Other      0.420349
ProfExe    0.224608
Office     0.166872
Mgr        0.135011
Self       0.033973
Sales      0.019187
Name: proportion, dtype: float64
--------------------------------------------------
BAD
BAD
0    4771
1    1189
Name: count, dtype: int64
BAD
0    0.800503
1    0.199497
Name: proportion, dtype: float64
--------------------------------------------------

Observations from Summary Statistics

  • Average Loan approved amount ~18,600, with minimum value of 1100 and maximum amount of 899,00. This column does not have any missing values.
  • Average amount due on existing mortgage is $73,760 ,min value of ~2000 and maximum value of 399,550.
  • Average value of current property is 101,776.
  • less than 75% of loans 1 or more major derogatory and 1+ Delinquent tradelines
  • Average number of Inquiries is ~1.2; average age of oldest TL trade is ~21 months
  • Average Debt to income ratio is ~34% and ~75% of loans have DTI below ~39%
  • Average borrower work experience is 9 yrs , less than 25% of borrowers have experience of more than 16 months.
  • 68% of loans were taken for Debt Consolidation.
  • 42% of loans do not have definition job type. Less than 2% are in sales type of jon and ~3.3% are Self-employed

All columns except 'BAD' and 'LOAN' have missing values.

Exploratory Data Analysis (EDA) and Visualization¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Leading Questions:

  1. What is the range of values for the loan amount variable "LOAN"?
  2. How does the distribution of years at present job "YOJ" vary across the dataset?
  3. How many unique categories are there in the REASON variable?
  4. What is the most common category in the JOB variable?
  5. Is there a relationship between the REASON variable and the proportion of applicants who defaulted on their loan?
  6. Do applicants who default have a significantly different loan amount compared to those who repay their loan?
  7. Is there a correlation between the value of the property and the loan default rate?
  8. Do applicants who default have a significantly different mortgage amount compared to those who repay their loan?

Univariate Analysis¶

In [84]:
# plot histogram for numeric variable to understand the distribution
hm[num_cols].hist(figsize=(20,20),layout=(4,5)  ,bins=20)
plt.show()
No description has been provided for this image
In [85]:
# Box plots to identify the outliers
hm[num_cols].plot(kind='box',subplots=True,layout=(4,4),figsize=(25,25))
Out[85]:
0
LOAN Axes(0.125,0.712609;0.168478x0.167391)
MORTDUE Axes(0.327174,0.712609;0.168478x0.167391)
VALUE Axes(0.529348,0.712609;0.168478x0.167391)
YOJ Axes(0.731522,0.712609;0.168478x0.167391)
DEROG Axes(0.125,0.511739;0.168478x0.167391)
DELINQ Axes(0.327174,0.511739;0.168478x0.167391)
CLAGE Axes(0.529348,0.511739;0.168478x0.167391)
NINQ Axes(0.731522,0.511739;0.168478x0.167391)
CLNO Axes(0.125,0.31087;0.168478x0.167391)
DEBTINC Axes(0.327174,0.31087;0.168478x0.167391)

No description has been provided for this image

All the numeric variables except age of oldest credit line and # of credit lines have significantly right skewed distributions and show presence of outliers in the data.

Bivariate Analysis¶

In [86]:
hm.groupby(['BAD'])[num_cols].mean()
Out[86]:
LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
BAD
0 19028.107315 74829.249055 102595.921018 9.154941 0.134217 0.245133 187.002355 1.032749 21.317036 33.253129
1 16922.119428 69460.452973 98172.846227 8.027802 0.707804 1.229185 150.190183 1.782765 21.211268 39.387645

On average, Loans which default

  • have lower exiting mortgage due as compared to those who do not
  • have slightly lower avg. years of experience at job ( 8 ys vs. 9 yrs for non-defaulters)
  • Higher average # of derog and # of delinquency than non-defaulters
  • Lower avg. age of oldest credit line opened and higher avg number of inquiries
  • High Avg. Debt to income ratio
  • have similar # of existing tradelines as non-defaulters
In [87]:
# Categorical variables vs. 'Bad'
for i in cat_cols[0:2]:
     print( pd.crosstab(hm[i], hm['BAD'], normalize = 'index')*100)
     mv =  (pd.crosstab(hm[i], hm['BAD'], normalize = 'index')*100).plot(kind = 'bar', figsize = (8, 4), stacked = True)
     plt.ylabel('% Bad  ')
     for p in mv.patches:
       plt.annotate(f'{int(p.get_height())}',
               (p.get_x() + p.get_width() / 2., p.get_y()),
               ha='center', va='center', xytext=(0, 5), textcoords='offset points')
     plt.show()
BAD              0          1
REASON                       
DebtCon  81.033605  18.966395
HomeImp  77.752809  22.247191
No description has been provided for this image
BAD              0          1
JOB                          
Mgr      76.662321  23.337679
Office   86.814346  13.185654
Other    76.800670  23.199330
ProfExe  83.385580  16.614420
Sales    65.137615  34.862385
Self     69.948187  30.051813
No description has been provided for this image
  • Loans with 'Home Improvement' have marginally higher bad rate as compared those for 'Debt consolidation'
  • Job type of 'Office' and 'Professional Exec' have lower bad rates as compared to remaining job types. 'Sales' and 'Self-employed' job type have among highest bad rates
In [88]:
plt.figure(figsize = [8, 6])

sns.scatterplot(data = hm ,x = hm.MORTDUE	, y = hm.VALUE,hue='BAD')
plt.title('Amount due on existing Mortgage vs Property Value ')
plt.show()
No description has been provided for this image

There is no definite differentiation in loan profile by the loan amount due on existing mortgage and property value. Loans that default have higher property value than those that do not.

In [89]:
plt.figure(figsize = [8, 6])

sns.scatterplot(data = hm ,x = hm.YOJ	, y = hm.CLAGE,hue='BAD')
plt.title('Years at Job vs Age of oldest Credit line ')
plt.show()
No description has been provided for this image

There is higher concentration of the default when years of experience is less than 10 years and age of oldest trade below 200 months.

In [90]:
plt.figure(figsize = [8, 6])
sns.scatterplot(data = hm ,x = hm.CLNO	, y = hm.NINQ,hue='BAD')
plt.title('# of Existing TL vs # of Inquiries ')
plt.show()
No description has been provided for this image

Defaults are observed when there are 4 or more inquiries.

There is no definite pattern observed with # of existing credit lines.

The combination of 5+ inquiries and 20+ # of existing trades have higher defaults

In [91]:
plt.figure(figsize = [8, 6])

sns.scatterplot(data = hm ,x = hm.DEBTINC	, y = hm.MORTDUE,hue='BAD')
plt.title('# Debt to Income vs existing mortgage due ')
plt.show()
No description has been provided for this image

There is higher concentration of defaults when Debt to income greater than 50% and mortgage amount is 500,000

Multivariate Analysis¶

In [92]:
plt.figure(figsize=(10,10))
sns.heatmap(hm[num_cols].corr(), annot = True, vmin = -1, vmax = 1, fmt = ".2f", cmap = "Spectral")
plt.title("Correlation Heatmap")
plt.show()
No description has been provided for this image

Amount due on existing mortgage and property value are strongly correlation at 88%. The mortgage amount due and property value have correlation of ~30% with age of oldest credit line.

Treating Outliers¶

In [93]:
#Outlier treatment for DELINQ , DEROG,YOJ
hm['DELINQ_flag']=hm['DELINQ'].apply(lambda x: 6 if x >=6 else x)
hm['DEROG_flag']=hm['DEROG'].apply(lambda x: 7if x >=7 else x)
hm['YOJ_calc']=hm['YOJ'].apply(lambda x: 30 if x >=30 else x)
new_col = hm.columns.tolist()
hm[new_col[13:16]].hist()
Out[93]:
array([[<Axes: title={'center': 'DELINQ_flag'}>,
        <Axes: title={'center': 'DEROG_flag'}>],
       [<Axes: title={'center': 'YOJ_calc'}>, <Axes: >]], dtype=object)
No description has been provided for this image
In [94]:
#Outlier treatment for DELINQ And DEROG
hm['DEBTINC_calc']=hm['DEBTINC'].apply(lambda x:60 if x >=60 else x)
hm['DEBTINC'].hist()
plt.show()
hm['DEBTINC_calc'].hist()
plt.show()
No description has been provided for this image
No description has been provided for this image

Post-outlier imputation, there is reduction in right skeweness of the distribution for debt to income.

Treating Missing Values¶

In [95]:
num_cols_1 = hm.select_dtypes(include=np.number).columns.tolist()
print(num_cols_1)
cat_cols_1 = hm.select_dtypes(include='object').columns.tolist()
print(cat_cols_1)
['BAD', 'LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC', 'DELINQ_flag', 'DEROG_flag', 'YOJ_calc', 'DEBTINC_calc']
['REASON', 'JOB']
In [96]:
#Impute Missing values by mean value for numeric variable and mode for categorical variables
for i in num_cols_1:
  hm[i].fillna(hm[i].mean(),inplace=True)

for i in cat_cols_1:
  hm[i].fillna(hm[i].mode()[0],inplace=True)
In [97]:
for col in cat_cols:
  print(col)
  print("***Checking distribution prior-imputation****")
  print(data[col].value_counts(normalize=True)),
  print("***Checking distribution post-imputation****")
  print(hm[col].value_counts(normalize=True)),
  print('-----'*10)
  #checking default rate before and after imputation
  print('****BEFORE imputation ****')
  print(pd.crosstab(data[col], hm['BAD'], normalize = 'index')*100)
  print('****AFTER imputation ****')
  print(pd.crosstab(hm[col], hm['BAD'], normalize = 'index')*100)
REASON
***Checking distribution prior-imputation****
REASON
DebtCon    0.688157
HomeImp    0.311843
Name: proportion, dtype: float64
***Checking distribution post-imputation****
REASON
DebtCon    0.701342
HomeImp    0.298658
Name: proportion, dtype: float64
--------------------------------------------------
****BEFORE imputation ****
BAD              0          1
REASON                       
DebtCon  81.033605  18.966395
HomeImp  77.752809  22.247191
****AFTER imputation ****
BAD              0          1
REASON                       
DebtCon  81.028708  18.971292
HomeImp  77.752809  22.247191
JOB
***Checking distribution prior-imputation****
JOB
Other      0.420349
ProfExe    0.224608
Office     0.166872
Mgr        0.135011
Self       0.033973
Sales      0.019187
Name: proportion, dtype: float64
***Checking distribution post-imputation****
JOB
Other      0.447483
ProfExe    0.214094
Office     0.159060
Mgr        0.128691
Self       0.032383
Sales      0.018289
Name: proportion, dtype: float64
--------------------------------------------------
****BEFORE imputation ****
BAD              0          1
JOB                          
Mgr      76.662321  23.337679
Office   86.814346  13.185654
Other    76.800670  23.199330
ProfExe  83.385580  16.614420
Sales    65.137615  34.862385
Self     69.948187  30.051813
****AFTER imputation ****
BAD              0          1
JOB                          
Mgr      76.662321  23.337679
Office   86.814346  13.185654
Other    78.365204  21.634796
ProfExe  83.385580  16.614420
Sales    65.137615  34.862385
Self     69.948187  30.051813
BAD
***Checking distribution prior-imputation****
BAD
0    0.800503
1    0.199497
Name: proportion, dtype: float64
***Checking distribution post-imputation****
BAD
0    0.800503
1    0.199497
Name: proportion, dtype: float64
--------------------------------------------------
****BEFORE imputation ****
BAD      0      1
BAD              
0    100.0    0.0
1      0.0  100.0
****AFTER imputation ****
BAD      0      1
BAD              
0    100.0    0.0
1      0.0  100.0

Post-imputation, increase in the share of category that was applied to missing values by ~3% and the default rate for category show de-minimus change.

Feature Engineering

In [98]:
# Calculating variable existing loan amount due to poperty value

hm['MORTDUE_VALUE_ratio']=hm['MORTDUE']/hm['VALUE']
hm['MORTDUE_VALUE_ratio_1'] = hm['MORTDUE_VALUE_ratio'].apply(lambda x: 1.01 if x >1 else x)


hm.describe().T
hm['MORTDUE_VALUE_ratio_1'].hist()
Out[98]:
<Axes: >
No description has been provided for this image
In [99]:
num_cols_1.append('MORTDUE_VALUE_ratio_1')
num_cols_1.remove('MORTDUE')
print(num_cols_1)
plt.figure(figsize=(10,10))
sns.heatmap(hm[num_cols_1].corr(), annot = True, vmin = -1, vmax = 1, fmt = ".2f", cmap = "Spectral")
plt.title("Correlation Heatmap")
['BAD', 'LOAN', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC', 'DELINQ_flag', 'DEROG_flag', 'YOJ_calc', 'DEBTINC_calc', 'MORTDUE_VALUE_ratio_1']
Out[99]:
Text(0.5, 1.0, 'Correlation Heatmap')
No description has been provided for this image
In [100]:
plt.figure(figsize = [8, 6])

sns.scatterplot(data = hm ,x = hm.MORTDUE_VALUE_ratio_1 , y =hm.DEBTINC	,hue='BAD')
plt.title('# Debt to Income vs Loan to Value ratio')
plt.show()
No description has been provided for this image

Loan with 50+ DTI and Amt due to value ratio have higher concentration of defaults

Important Insights from EDA¶

What are the the most important observations and insights from the data based on the EDA performed?

Data Checks • Dataset has 5,960 rows and 13 columns.11 variables are numeric, and 2 variables are non-numeric.

• Except 'BAD' and 'LOAN', remaining variables have missing values.

-'DEBTINC' has 21% and 'DELINQ' has 11% missing, other 9 variables are less than 10% missing.

Univariate analysis:

-Numeric variables except age of oldest credit line and # of credit lines have right skewed distributions. All numeric variables have outliers in the data.

Bivariate analysis:

-Average values of numeric variables were compared by default variable ‘BAD’ and it was observed that loans that default have following characteristics :

  • 12% lower mortgage amount due and 8% lower property value
  • 14% lower years of experience at job ( 8 ys vs. 9 yrs for non-defaulters)
  • 5% higher # of derog and # of delinquency than non-defaulters
  • 20% lower age of oldest credit line opened and 1.7 times higher avg number of inquiries -Higher Avg. Debt to income ratio There is no significant differeence in average # of existing tradelines across defaulters vs. non-defaulters -Loans with 'Home Improvement' have marginally higher bad rate as compared those for 'Debt consolidation'
  • Job type of 'Office' and 'Professional Exec' have lower bad rates as compared to remaining

Multivariate analysis:

-There is higher concentration of defaults when Debt to income greater than 50% and mortgage amount is 500,000

-There is higher concentration of the default when years of experience is less than 10 years and age of oldest trade below 200 months.

-Defaults are observed when there are 4 or more inquiries. -The combination of 5+ inquiries and 20+ # of existing trades also have higher defaults

Outlier Treatment Missing value treatment:

-Due to highly skewed distribution on ‘DELINQ’,’DEROG’,’DEBTINC’ and ‘YOJ’ outlier treatment is applied on the numeric variables to achieve .

-Missing values in numeric variables were imputed with average values as none of variable had greater than 20% missings.

-For categorical variables, the missing values were imputed with the most frequently occuring category.

Model Building - Approach¶

  • Data preparation
  • Partition the data into train and test set
  • Build the model
  • Fit on the train data
  • Tune the model
  • Test the model on test set

Preparing data for modeling

Creating new feature and creating dummy variables for categorical variables

In [101]:
#Create dummy variables for categorical variables
hm_tomodel=pd.get_dummies(hm,columns=cat_cols[0:2],drop_first=False)

hm_tomodel.head(5)
Out[101]:
BAD LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO ... MORTDUE_VALUE_ratio MORTDUE_VALUE_ratio_1 REASON_DebtCon REASON_HomeImp JOB_Mgr JOB_Office JOB_Other JOB_ProfExe JOB_Sales JOB_Self
0 1 1100 25860.0000 39025.000000 10.500000 0.00000 0.000000 94.366667 1.000000 9.000000 ... 0.662652 0.662652 False True False False True False False False
1 1 1300 70053.0000 68400.000000 7.000000 0.00000 2.000000 121.833333 0.000000 14.000000 ... 1.024167 1.010000 False True False False True False False False
2 1 1500 13500.0000 16700.000000 4.000000 0.00000 0.000000 149.466667 1.000000 10.000000 ... 0.808383 0.808383 False True False False True False False False
3 1 1500 73760.8172 101776.048741 8.922268 0.25457 0.449442 179.766275 1.186055 21.296096 ... 0.724736 0.724736 True False False False True False False False
4 0 1700 97800.0000 112000.000000 3.000000 0.00000 0.000000 93.333333 0.000000 14.000000 ... 0.873214 0.873214 False True False True False False False False

5 rows × 25 columns

Partition the data into train and test set

In [102]:
#Separating the dataset into indepdent variables and target
X= hm_tomodel.drop(['BAD','DELINQ','DEROG','MORTDUE_VALUE_ratio','YOJ','DEBTINC','MORTDUE','LOAN'],axis=1)

y=hm_tomodel['BAD']
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5960 entries, 0 to 5959
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   VALUE                  5960 non-null   float64
 1   CLAGE                  5960 non-null   float64
 2   NINQ                   5960 non-null   float64
 3   CLNO                   5960 non-null   float64
 4   DELINQ_flag            5960 non-null   float64
 5   DEROG_flag             5960 non-null   float64
 6   YOJ_calc               5960 non-null   float64
 7   DEBTINC_calc           5960 non-null   float64
 8   MORTDUE_VALUE_ratio_1  5960 non-null   float64
 9   REASON_DebtCon         5960 non-null   bool   
 10  REASON_HomeImp         5960 non-null   bool   
 11  JOB_Mgr                5960 non-null   bool   
 12  JOB_Office             5960 non-null   bool   
 13  JOB_Other              5960 non-null   bool   
 14  JOB_ProfExe            5960 non-null   bool   
 15  JOB_Sales              5960 non-null   bool   
 16  JOB_Self               5960 non-null   bool   
dtypes: bool(8), float64(9)
memory usage: 465.8 KB

Scaling the features

In [103]:
# Scaling the data to standardize the values with respect to the mean values for each variable


sc = StandardScaler()

X_scaled = sc.fit_transform(X)

X_scaled = pd.DataFrame(X_scaled, columns = X.columns)

Creating training and test dataset for modeling

In [104]:
X_train , X_test, y_train, y_test = train_test_split(X_scaled,y,test_size=0.3,random_state=42)

Function for Model Performance Comparison¶

In [105]:
#Function to evaluate Model Performance - Precision, REcall, Accuracy, F1 Score, Area Under the curve
from sklearn.metrics import confusion_matrix, classification_report,recall_score,precision_score, accuracy_score,f1_score
#Create function to measure peformance
def measure_performance(X,y,clf,m_name,m_num,show_accuracy=True,show_classification_report=True,show_confusion_matrix=True,):

    y_pred=clf.predict(X)
    recall = recall_score(y, y_pred,average = 'macro')                 # To compute recall
    precision = precision_score(y, y_pred, average = 'macro')          # To compute precision
    acc = accuracy_score(y, y_pred)
    f_score=f1_score(y, y_pred, average = 'macro')
    if show_accuracy:
        print("Accuracy:{0:.3f}".format(acc),"\n")


    df_tmp = pd.DataFrame(
       {
            'num':m_num,
            "model_name": m_name,
            "Precision":  precision,
            "Recall":  recall,
            "Accuracy": acc,
            "F1_score":f_score


        },

       index = [m_num],
    )


    if show_classification_report:
        print("Classification report:")
        print(classification_report(y, y_pred))
    if show_confusion_matrix:
       cm = confusion_matrix(y, y_pred)
       print("Confusion Matrix")
       print(cm)
       sns.heatmap(cm, annot=True,fmt = '.2f', xticklabels = ['Actual Non-Default', 'Default'], yticklabels = ['Predicted Non-Default', 'Predicted Default'])
       plt.show()



    return df_tmp

Logistic Regression

In [106]:
# Logistic regression
lg = LogisticRegression(random_state = 1, max_iter=1000)
lg.fit(X_train, y_train)
Out[106]:
LogisticRegression(max_iter=1000, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=1000, random_state=1)
In [107]:
# Measure the performance of first Logistic Regression Model
df_perf_lg_a = measure_performance(X_train,y_train,lg,'Logistic Regression Train',1)
df_perf_lg_b = measure_performance(X_test,y_test,lg,'Logistic Regression Test',2)
print ("***********Model Performance****************")
print(df_perf_lg_a)
print(df_perf_lg_b)
Accuracy:0.843 

Classification report:
              precision    recall  f1-score   support

           0       0.85      0.97      0.91      3382
           1       0.71      0.28      0.40       790

    accuracy                           0.84      4172
   macro avg       0.78      0.63      0.66      4172
weighted avg       0.83      0.84      0.81      4172

Confusion Matrix
[[3293   89]
 [ 568  222]]
No description has been provided for this image
Accuracy:0.817 

Classification report:
              precision    recall  f1-score   support

           0       0.83      0.96      0.89      1389
           1       0.70      0.31      0.43       399

    accuracy                           0.82      1788
   macro avg       0.77      0.64      0.66      1788
weighted avg       0.80      0.82      0.79      1788

Confusion Matrix
[[1337   52]
 [ 276  123]]
No description has been provided for this image
***********Model Performance****************
   num                 model_name  Precision    Recall  Accuracy  F1_score
1    1  Logistic Regression Train   0.783357  0.627348  0.842522  0.656281
   num                model_name  Precision    Recall  Accuracy  F1_score
2    2  Logistic Regression Test   0.765874  0.635417  0.816555  0.659655
In [108]:
# Coefficients and Odds Ratios
coefficients =lg.coef_[0]
odds_ratios = np.exp(coefficients)
# Display feature importance using coefficients and odds ratios
lg_feature_importance= pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': coefficients,
    'Odds Ratio': odds_ratios
})
print("\nFeature Importance (Coefficient and Odds Ratio):")
print(lg_feature_importance.sort_values(by='Coefficient', ascending=False))
Feature Importance (Coefficient and Odds Ratio):
                  Feature  Coefficient  Odds Ratio
4             DELINQ_flag     0.784337    2.190953
5              DEROG_flag     0.453479    1.573777
7            DEBTINC_calc     0.315807    1.371365
2                    NINQ     0.275793    1.317575
15              JOB_Sales     0.201428    1.223148
10         REASON_HomeImp     0.083637    1.087234
13              JOB_Other     0.075669    1.078605
16               JOB_Self     0.073272    1.076023
0                   VALUE     0.057312    1.058986
14            JOB_ProfExe     0.016971    1.017116
11                JOB_Mgr    -0.024331    0.975962
9          REASON_DebtCon    -0.083637    0.919765
8   MORTDUE_VALUE_ratio_1    -0.130489    0.877667
6                YOJ_calc    -0.149173    0.861420
3                    CLNO    -0.165068    0.847836
12             JOB_Office    -0.208894    0.811481
1                   CLAGE    -0.487084    0.614415
In [109]:
rfe_model = LogisticRegression(max_iter=10000, solver='liblinear',random_state=1,class_weight='balanced')
rfe = RFE(rfe_model, n_features_to_select=8)
rfe.fit(X_train, y_train)
rfe_features = X.columns[rfe.support_]
print("\nSelected Features by RFE:")
print('FEatures:',rfe_features)
print('Ranking:',rfe.ranking_)
df_perf_rfe_a = measure_performance(X_train,y_train,rfe,'Logistic Regression w. RFE Train',3)
df_perf_rfe_b = measure_performance(X_test,y_test,rfe,'Logistic Regression w RFe  Test',4)
print ("***********Model Performance****************")
print(df_perf_rfe_a)
print(df_perf_rfe_b)

from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_Series = pd.Series([variance_inflation_factor(X_train[rfe_features].values,i) for i in range(X_train[rfe_features].shape[1])],index=X_train[rfe_features].columns)
print(vif_Series)
Selected Features by RFE:
FEatures: Index(['CLAGE', 'NINQ', 'CLNO', 'DELINQ_flag', 'DEROG_flag', 'DEBTINC_calc',
       'JOB_Office', 'JOB_Sales'],
      dtype='object')
Ranking: [ 8  1  1  1  1  1  3  1  4  2  5  9  1  6 10  1  7]
Accuracy:0.756 

Classification report:
              precision    recall  f1-score   support

           0       0.90      0.79      0.84      3382
           1       0.41      0.63      0.49       790

    accuracy                           0.76      4172
   macro avg       0.65      0.71      0.67      4172
weighted avg       0.81      0.76      0.77      4172

Confusion Matrix
[[2656  726]
 [ 293  497]]
No description has been provided for this image
Accuracy:0.731 

Classification report:
              precision    recall  f1-score   support

           0       0.87      0.77      0.82      1389
           1       0.43      0.61      0.50       399

    accuracy                           0.73      1788
   macro avg       0.65      0.69      0.66      1788
weighted avg       0.77      0.73      0.75      1788

Confusion Matrix
[[1063  326]
 [ 155  244]]
No description has been provided for this image
***********Model Performance****************
   num                        model_name  Precision    Recall  Accuracy  \
3    3  Logistic Regression w. RFE Train   0.653511  0.707224  0.755753   

   F1_score  
3  0.666418  
   num                       model_name  Precision    Recall  Accuracy  \
4    4  Logistic Regression w RFe  Test   0.650406  0.688414  0.730984   

   F1_score  
4  0.659554  
CLAGE           1.080902
NINQ            1.067755
CLNO            1.128125
DELINQ_flag     1.058822
DEROG_flag      1.072238
DEBTINC_calc    1.050084
JOB_Office      1.012306
JOB_Sales       1.011224
dtype: float64
In [110]:
# Added post-Milestone Submission
# Identify the importance of features in the RFE model
rfe_coefficients = rfe.estimator_.coef_[0]
rfe_odds_ratios = np.exp(rfe_coefficients)
rfe_feature_importance= pd.DataFrame({
    'Feature': X_train[rfe_features].columns,
    'Coefficient': rfe_coefficients,
    'Odds Ratio': rfe_odds_ratios})
print("\nFeature Importance (Coefficient and Odds Ratio) for RFE selected features:")
print(rfe_feature_importance.sort_values(by='Coefficient', ascending=False))
Feature Importance (Coefficient and Odds Ratio) for RFE selected features:
        Feature  Coefficient  Odds Ratio
3   DELINQ_flag     0.780168    2.181838
4    DEROG_flag     0.508693    1.663116
5  DEBTINC_calc     0.326577    1.386214
1          NINQ     0.292439    1.339691
7     JOB_Sales     0.195805    1.216290
6    JOB_Office    -0.247626    0.780652
2          CLNO    -0.250977    0.778040
0         CLAGE    -0.423836    0.654531

Decision Tree¶

In [111]:
# Building decision tree model
# Model #2
dt = DecisionTreeClassifier(random_state = 1,max_depth=3,class_weight='balanced')
dt.fit(X_train, y_train)
Out[111]:
DecisionTreeClassifier(class_weight='balanced', max_depth=3, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=3, random_state=1)
In [112]:
# Measure the performance of first Decision tree Model
df_perf_dt_a = measure_performance(X_train,y_train,dt,'Decision Tree Train',5)
df_perf_dt_b = measure_performance(X_test,y_test,dt,'Decision Tree Test',6)
print ("***********Model Performance****************")
print(df_perf_dt_a)
print(df_perf_dt_b)
Accuracy:0.864 

Classification report:
              precision    recall  f1-score   support

           0       0.94      0.89      0.91      3382
           1       0.62      0.74      0.67       790

    accuracy                           0.86      4172
   macro avg       0.78      0.82      0.79      4172
weighted avg       0.88      0.86      0.87      4172

Confusion Matrix
[[3016  366]
 [ 203  587]]
No description has been provided for this image
Accuracy:0.869 

Classification report:
              precision    recall  f1-score   support

           0       0.93      0.90      0.91      1389
           1       0.68      0.76      0.72       399

    accuracy                           0.87      1788
   macro avg       0.81      0.83      0.82      1788
weighted avg       0.88      0.87      0.87      1788

Confusion Matrix
[[1248  141]
 [  94  305]]
No description has been provided for this image
***********Model Performance****************
   num           model_name  Precision    Recall  Accuracy  F1_score
5    5  Decision Tree Train   0.776443  0.817409  0.863615  0.793676
   num          model_name  Precision   Recall  Accuracy  F1_score
6    6  Decision Tree Test   0.806906  0.83145  0.868568  0.817922
In [113]:
# Building decision tree model using GridSearch
dt_1 = DecisionTreeClassifier(random_state = 1)
param_grid = {'max_depth' : np.arange(2,7,1),
              'criterion': ['gini', 'entropy'],
              'min_samples_leaf' : [3,5,7]  }
#use Grid Search
gridSearch = GridSearchCV(dt_1,param_grid, scoring='recall_weighted',cv=5)
gridSearch.fit(X_train,y_train)

best_params = gridSearch.best_params_
best_recall = gridSearch.best_score_


# print the best parameters for max depth and min number of nodes
print(best_params)

print(best_recall )
{'criterion': 'gini', 'max_depth': np.int64(6), 'min_samples_leaf': 3}
0.8787119286606643
In [114]:
dt_1 = DecisionTreeClassifier(random_state = 1, max_depth =best_params['max_depth'], min_samples_leaf = best_params['min_samples_leaf'],criterion=best_params['criterion'])
dt_1.fit(X_train, y_train)
Out[114]:
DecisionTreeClassifier(max_depth=np.int64(6), min_samples_leaf=3,
                       random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=np.int64(6), min_samples_leaf=3,
                       random_state=1)
In [115]:
# Measure the performance of  Decision tree Model using the

df_perf_dt1_a = measure_performance(X_train,y_train,dt_1,'DecisionTree_train_GV',7)
df_perf_dt1_b = measure_performance(X_test,y_test,dt_1,'DecisionTree_test_GV',8)
print ("************Model Performance****************")
print(df_perf_dt1_a)
print(df_perf_dt1_b)
Accuracy:0.895 

Classification report:
              precision    recall  f1-score   support

           0       0.93      0.94      0.94      3382
           1       0.73      0.71      0.72       790

    accuracy                           0.90      4172
   macro avg       0.83      0.82      0.83      4172
weighted avg       0.89      0.90      0.89      4172

Confusion Matrix
[[3176  206]
 [ 232  558]]
No description has been provided for this image
Accuracy:0.878 

Classification report:
              precision    recall  f1-score   support

           0       0.91      0.94      0.92      1389
           1       0.76      0.66      0.71       399

    accuracy                           0.88      1788
   macro avg       0.83      0.80      0.81      1788
weighted avg       0.87      0.88      0.87      1788

Confusion Matrix
[[1307   82]
 [ 137  262]]
No description has been provided for this image
************Model Performance****************
   num             model_name  Precision    Recall  Accuracy  F1_score
7    7  DecisionTree_train_GV   0.831146  0.822709  0.895014   0.82682
   num            model_name  Precision    Recall  Accuracy  F1_score
8    8  DecisionTree_test_GV   0.833376  0.798803  0.877517  0.813973
In [116]:
# combined performance from all the decision tree models
df_perf_combd_1 = pd.merge(df_perf_lg_a,df_perf_lg_b,how='outer')
df_perf_combd_1a = pd.merge(df_perf_rfe_a ,df_perf_rfe_b,how='outer')
df_perf_combd_2 = pd.merge(df_perf_dt_a ,df_perf_dt_b,how='outer')
df_perf_combd_3 = pd.merge(df_perf_dt1_a ,df_perf_dt1_b,how='outer')
df_perf_combd_t= pd.merge(df_perf_combd_1,df_perf_combd_1a,how='outer')
df_perf_combd_t1= pd.merge(df_perf_combd_t,df_perf_combd_2,how='outer')
df_perf_combd_all = pd.merge(df_perf_combd_t1,df_perf_combd_3,how='outer')
df_perf_combd_all
Out[116]:
num model_name Precision Recall Accuracy F1_score
0 1 Logistic Regression Train 0.783357 0.627348 0.842522 0.656281
1 2 Logistic Regression Test 0.765874 0.635417 0.816555 0.659655
2 3 Logistic Regression w. RFE Train 0.653511 0.707224 0.755753 0.666418
3 4 Logistic Regression w RFe Test 0.650406 0.688414 0.730984 0.659554
4 5 Decision Tree Train 0.776443 0.817409 0.863615 0.793676
5 6 Decision Tree Test 0.806906 0.831450 0.868568 0.817922
6 7 DecisionTree_train_GV 0.831146 0.822709 0.895014 0.826820
7 8 DecisionTree_test_GV 0.833376 0.798803 0.877517 0.813973
In [117]:
# Logistice and Decision Tree Model - Feature IMportance comparison
columns = X.columns
importance_dt_md3 = dt.feature_importances_
importance_dt_gv = dt_1.feature_importances_
lg_feature_importance_1 = lg_feature_importance.sort_values(by='Odds Ratio', ascending=False)
rfe_feature_importance_1 = rfe_feature_importance.sort_values(by='Odds Ratio', ascending=False)
#importance_df_lg_rfe = pd.DataFrame(, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
importance_df_md_3 = pd.DataFrame(importance_dt_md3, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
importance_df_gv = pd.DataFrame(importance_dt_gv, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)

fig, axes = plt.subplots(2, 2, figsize=(15,10))  # Adjust figsize for better spacing

sns.barplot(x=lg_feature_importance_1['Odds Ratio'], y=lg_feature_importance_1.Feature, ax=axes[0,0])
axes[0,0].set_title('**Feature Importance - Logistic Regression (Odds Ratio)**')

sns.barplot(x=rfe_feature_importance_1['Odds Ratio'] ,y=rfe_feature_importance_1.Feature, ax=axes[0,1])
axes[0,1].set_title('**Feature Importance - Logistic Regression w Recurssive Feature Elimination (Odds Ratio)**')

sns.barplot(x= importance_df_md_3.Importance, y=importance_df_md_3.index,ax=axes[1,0])
axes[1,0].set_title('**Feature Importance - Decision Tree Model - Max depth 3**')

sns.barplot(x=importance_df_gv.Importance,y=importance_df_gv.index,ax=axes[1,1])
axes[1,1].set_title('**Feature Importance - Decision Tree Model - Grid Search Max Depth =6 , Min sample leaf = 3**')

plt.tight_layout()
plt.show()
No description has been provided for this image

Building a Random Forest Classifier¶

Random Forest is a bagging algorithm where the base models are Decision Trees. Samples are taken from the training data and on each sample a decision tree makes a prediction.

The results from all the decision trees are combined together and the final prediction is made using voting or averaging.

In [118]:
# Base RandomForestClassifier

rf_estimator = RandomForestClassifier( random_state = 1,class_weight = 'balanced')

rf_estimator.fit(X_train, y_train)
Out[118]:
RandomForestClassifier(class_weight='balanced', random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight='balanced', random_state=1)
In [119]:
# Model Performance evaluation - Base random forest classifier
df_perf_rf1_a = measure_performance(X_train,y_train,rf_estimator,'randomForest_train',9)
df_perf_rf1_b = measure_performance(X_test,y_test,rf_estimator,'randomForest_test',10)
print ("************Model Performance****************")
print(df_perf_rf1_a)
print(df_perf_rf1_b)
Accuracy:1.000 

Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3382
           1       1.00      1.00      1.00       790

    accuracy                           1.00      4172
   macro avg       1.00      1.00      1.00      4172
weighted avg       1.00      1.00      1.00      4172

Confusion Matrix
[[3382    0]
 [   0  790]]
No description has been provided for this image
Accuracy:0.897 

Classification report:
              precision    recall  f1-score   support

           0       0.90      0.98      0.94      1389
           1       0.89      0.62      0.73       399

    accuracy                           0.90      1788
   macro avg       0.89      0.80      0.83      1788
weighted avg       0.90      0.90      0.89      1788

Confusion Matrix
[[1358   31]
 [ 153  246]]
No description has been provided for this image
************Model Performance****************
   num          model_name  Precision  Recall  Accuracy  F1_score
9    9  randomForest_train        1.0     1.0       1.0       1.0
    num         model_name  Precision    Recall  Accuracy  F1_score
10   10  randomForest_test   0.893415  0.797112  0.897092  0.832181

Random Forest Classifier Hyperparameter Tuning¶

In [120]:
# Random Forest classifer initialization for hyperparameter tuning
rf_estimator_tuned = RandomForestClassifier(random_state = 1)

# Grid of parameters to choose from
params_rf = {
        "n_estimators": [5,7,9],
        "min_samples_leaf": np.arange(2,7,1),
        "max_features": [0.2, 0.5,0.7, 'auto'],
}

import sklearn.metrics as metrics
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)

# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, params_rf, scoring = scorer, cv = 5)

grid_obj = grid_obj.fit(X_train, y_train)

# Set the classifier to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
print(grid_obj.best_params_)
{'max_features': 0.7, 'min_samples_leaf': np.int64(4), 'n_estimators': 7}
In [121]:
# Random Tree classifer with best parameters identified through GridSearch
rf_estimator_tuned = RandomForestClassifier(random_state = 1,n_estimators=grid_obj.best_params_['n_estimators'],min_samples_leaf=grid_obj.best_params_['min_samples_leaf'],max_features=grid_obj.best_params_['max_features'])
rf_estimator_tuned.fit(X_train, y_train)
Out[121]:
RandomForestClassifier(max_features=0.7, min_samples_leaf=np.int64(4),
                       n_estimators=7, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_features=0.7, min_samples_leaf=np.int64(4),
                       n_estimators=7, random_state=1)
In [122]:
df_perf_rf2_a = measure_performance(X_train,y_train,rf_estimator_tuned,'randomForest_train_w_GV',11)
df_perf_rf2_b = measure_performance(X_test,y_test,rf_estimator_tuned,'randomForest_test_GV',12)
print ("************Model Performance****************")
print(df_perf_rf2_a)
print(df_perf_rf2_b)
Accuracy:0.951 

Classification report:
              precision    recall  f1-score   support

           0       0.95      0.99      0.97      3382
           1       0.93      0.79      0.86       790

    accuracy                           0.95      4172
   macro avg       0.94      0.89      0.91      4172
weighted avg       0.95      0.95      0.95      4172

Confusion Matrix
[[3338   44]
 [ 162  628]]
No description has been provided for this image
Accuracy:0.883 

Classification report:
              precision    recall  f1-score   support

           0       0.90      0.96      0.93      1389
           1       0.81      0.62      0.70       399

    accuracy                           0.88      1788
   macro avg       0.86      0.79      0.81      1788
weighted avg       0.88      0.88      0.88      1788

Confusion Matrix
[[1332   57]
 [ 152  247]]
No description has been provided for this image
************Model Performance****************
    num               model_name  Precision    Recall  Accuracy  F1_score
11   11  randomForest_train_w_GV   0.944119  0.890963  0.950623  0.914582
    num            model_name  Precision    Recall  Accuracy  F1_score
12   12  randomForest_test_GV   0.855037  0.789005   0.88311  0.814978
In [123]:
# combined performance from Random Forest classifier results
df_perf_rf_1 = pd.merge(df_perf_rf1_a,df_perf_rf1_b,how='outer')
df_perf_rf_2 = pd.merge(df_perf_rf2_a,df_perf_rf2_b,how='outer')
df_perf_combd_rf_all = pd.merge(df_perf_rf_1,df_perf_rf_2,how='outer')
# combining Decision tree and Random forest classifier results

df_perf_combd_dt_rf = pd.merge(df_perf_combd_all,df_perf_combd_rf_all,how='outer')
df_perf_combd_dt_rf
Out[123]:
num model_name Precision Recall Accuracy F1_score
0 1 Logistic Regression Train 0.783357 0.627348 0.842522 0.656281
1 2 Logistic Regression Test 0.765874 0.635417 0.816555 0.659655
2 3 Logistic Regression w. RFE Train 0.653511 0.707224 0.755753 0.666418
3 4 Logistic Regression w RFe Test 0.650406 0.688414 0.730984 0.659554
4 5 Decision Tree Train 0.776443 0.817409 0.863615 0.793676
5 6 Decision Tree Test 0.806906 0.831450 0.868568 0.817922
6 7 DecisionTree_train_GV 0.831146 0.822709 0.895014 0.826820
7 8 DecisionTree_test_GV 0.833376 0.798803 0.877517 0.813973
8 9 randomForest_train 1.000000 1.000000 1.000000 1.000000
9 10 randomForest_test 0.893415 0.797112 0.897092 0.832181
10 11 randomForest_train_w_GV 0.944119 0.890963 0.950623 0.914582
11 12 randomForest_test_GV 0.855037 0.789005 0.883110 0.814978
In [124]:
# Random Forest Classifier - Feature Importance
importances_rf = rf_estimator.feature_importances_
importance_df_rf = pd.DataFrame(importances_rf, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)


# Random Forest Classifier with Tuning - Feature Importance
importances_rf2 = rf_estimator_tuned.feature_importances_
importance_df_rf2 = pd.DataFrame(importances_rf2, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)


# Importance Features from each of Decision Tree and Random Forest Classifier

fig, axes = plt.subplots(3, 2, figsize=(15, 15))  # Adjust figsize for better spacing
sns.barplot(x=lg_feature_importance_1['Odds Ratio'], y=lg_feature_importance_1.Feature, ax=axes[0,0])
axes[0,0].set_title('**Feature Importance - Logistic Regression (Odds Ratio)**')

sns.barplot(x=rfe_feature_importance_1['Odds Ratio'] ,y=rfe_feature_importance_1.Feature, ax=axes[0,1])
axes[0,1].set_title('**Feature Importance - Logistic Regression w Recurssive Feature Elimination (Odds Ratio)**')

axes[0,1].set_ylabel('Features')

sns.barplot(x= importance_df_md_3.Importance, y=importance_df_md_3.index,ax=axes[1,0])
axes[1,0].set_title('**Feature Importance - Decision Tree Classifier  - Max depth =3**')

sns.barplot(x=importance_df_gv.Importance,y=importance_df_gv.index,ax=axes[1,1])
axes[1,1].set_title('**Feature Importance -Decision Tree Classifier - Grid Search Max Depth =6 , Min sample leaf=3**')

sns.barplot(x= importance_df_rf.Importance, y=importance_df_rf.index,ax=axes[2,0])
axes[2,0].set_title('**Feature Importance -Random Forest Classifier**')
sns.barplot(x= importance_df_rf2.Importance, y=importance_df_rf2.index,ax=axes[2,1]);
axes[2,1].set_title('**Feature Importance - Random Forest Model - Grid Search**')



plt.tight_layout()
plt.show()
No description has been provided for this image

Insights¶

1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):

  • How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?

3 different types of classification models, namely Logistic Regression, Decision Tree classifier and Random Forest classifier were trained and validated.

  • Logistic regression model has accuracy of 81% but Recall were lower at 52%. There was drop in performance on test data with 4% drop in accuracy to 78% and recall remained around 51%.
  • Logistic regression with Recurssive feature elimination has accuracy 75% and recall of 70% and has validates well on test dataset
  • The base Decision Tree classifier with max depth of 3 has accuracy of 76% and recall of 62%. The performacne on test dataset was 5% drop in accuracy and recall dropped to 59%
  • Decision tree classifier developed with Hyper-parameter tuning using Grid search and using cross-validation showed improved accuracy of 90%% and Recall of 92%. The test data showed drop in Accuracy and Recall dropped 80%
  • Based Random Forest classifier show significant overfitting and on applying hyper-parameter tuning the accuracy was much 86% and was still showing overfitting. Based on Accuracy and recall, Random Forest with Grid search is provide better ability capture the defaults but its overfitting training data. As compared to this Logistic regression with recursive feature elimination has good accuracy and recall also validate well on test dataset.

2. Refined insights:

  • What are the most meaningful insights relevant to the problem?

Logistic Regression modeling technique for classifying default vs. non-defaulting loans shows overall good performance with acccuracy ~75%, recall of ~70% and performance on test data is comparable to training dataset. Loans with following attributes have high likelihood of default:

  • Presence DElinquency
  • Presence of Prior derogatory behavior
  • Higher Debt to Income ratio
  • Number of Inquiries
  • Age of oldest tradeline opened
  • Number of existing trades
  • Job type of Sales

3. Proposal for the final solution design:

  • What model do you propose to be adopted? Why is this the best solution to adopt?

Given lower accuracy ,lower recall and overfitting , I propose to evaluate

  1. Look at adjusting hyperparameters for decision tree and random forest as well refinement of Logistic regression model.
  2. Evaluate additional model performance metrics and also checking how effective is model in providing transparency for regulatory purpose and how to it can provide underwriting efficiency.
In [125]:
#! jupyter nbconvert --to html  '/content/drive/My Drive/GL_MIT_ADSP/Week9_Capstone/ChetanaShah_Capstone_Project_Notebook_Loan_Default_Prediction_Full_Code.ipynb'

********************** ***For Final Submission*****************************¶

1) Executive Summary¶

  • What are the important findings from the analysis done in Milestone?

    • The analysis is based on existing loans data with 5,690 observations and 13 variables.
    • The bad rate for the dataset is 19.9%
    • The data has 11 variables with missing values. All the numeric variables showed right skewed distribution and presennce of outliers.
    • For numeric variable, average for particular variable was used to impute the missings and for categorical variables, mode was used for imputation
      • For variables with highly skewed variables like
  • Describe the final propose model specifications -Logistics Regression, Decision trees and Random Forest techniques were solve the classification problem. Logistic Regression model had lower accuracy at ~74% and recall at 71% -Models based on Decision tree and Random Forest has accuracy but were overiftting the training data. -Decision Tree with Hyperparameter Tuning improved accuracy but still had overfitting.

    -With need for interpretability of the model, Decision tree based model is proposed for final model solution.

Brief summary of Data exploration

In [126]:
# Missing Rate
tmp = pd.DataFrame(data.isna().sum()/data.shape[0]*100,columns=['Missing_Rate'])
tmp.index.name = 'Feature'
tmp.sort_values(by='Missing_Rate',ascending=False)
plt.figure(figsize=(7,5))
mv2 = sns.barplot(data=tmp[2:],x='Feature',y='Missing_Rate')
for p in mv2.patches:
    mv2.annotate(f'{int(p.get_height())}',
                (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', xytext=(0, 5), textcoords='offset points')
plt.xticks(rotation=90)
plt.title('Missing Rate in ')
plt.show()
# Key # of Loans and BAd Rate
No description has been provided for this image
In [127]:
fig, axes = plt.subplots(2, 2, figsize=(12,10))  # Adjust figsize for better spacing
sns.barplot(data=hm['JOB'].value_counts('normalize'),ax=axes[0,0])
axes[0,0].set_title('**Job Type**')
(pd.crosstab(hm['JOB'], hm['BAD'], normalize = 'index')*100).plot(kind = 'bar', stacked = True,ax=axes[0,1],rot=0)

axes[0,1].set_title('**Job Type vs Bad Rate**')

sns.barplot(data=hm['REASON'].value_counts('normalize'),ax=axes[1,0])
axes[1,0].set_title('**LOAN REAsON**')
(pd.crosstab(hm['REASON'], hm['BAD'], normalize = 'index')*100).plot(kind = 'bar', stacked = True,ax=axes[1,1],rot=0)
axes[1,1].set_title('**REASON vs Bad Rate**')
plt.show()
No description has been provided for this image

Final Proposed Model Solution

In [128]:
# Feature Importance for Decision Tree Model with Hyper parameter tuning
sns.barplot(x=importance_df_gv.Importance,y=importance_df_gv.index)
plt.title('**Feature Importance -Decision Tree Classifier - Grid Search Max Depth =6 , Min sample leaf=3**')

plt.figure(figsize=(20,10))

plot_tree(dt_1,feature_names=X_train.columns,filled=True,fontsize=12)
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image
In [129]:
# ROC comparison
  # Logistic Regression Model
fpr1, tpr1, _ = roc_curve(y_test, rfe.predict(X_test))
roc_auc1 = auc(fpr1, tpr1)

   # Decision Tree Max depth 3
fpr2, tpr2, _ = roc_curve(y_test, dt.predict(X_test))
roc_auc2 = auc(fpr2, tpr2)


# DEcision tree w/ Grid Search MAx Depth and Min sample Leaf
fpr3, tpr3, _ = roc_curve(y_test, dt_1.predict(X_test))
roc_auc3 = auc(fpr3, tpr3)


# Decision Tree w/ Max Depth and Max leaf nodes
fpr4, tpr4, _ = roc_curve(y_test,rf_estimator_tuned.predict(X_test))
roc_auc4 = auc(fpr4, tpr4)








plt.figure(figsize=(7,7))
plt.plot(fpr1, tpr1, color='orange', lw=2, label='Logistic Regression  ROC  (area = %0.2f)' % roc_auc1)
plt.plot(fpr2, tpr2, color='lightblue', lw=2, label='Decision Tree Max depth 3 ROC  (area = %0.2f)' % roc_auc2)
plt.plot(fpr3, tpr3, color='violet', lw=2, label='Decision Tree Grid Search (area = %0.2f)' % roc_auc3)
plt.plot(fpr4, tpr4, color='navy', lw=2, label='Random Forest (area = %0.2f)' % roc_auc4)
plt.plot([0, 1], [0, 1], color='black', lw=2, linestyle='--') # Random classifier line
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic  Comparison')
plt.legend(loc="lower right")
plt.show()
No description has been provided for this image

Final Model Solution¶

In [130]:
# Refining the Decision tree model with Grid Search to evaluate thc combination of max depth and max leaf and controlling the criteria='Gini' and min sample leaf to 3 to improve accuracy and reduce overfitting
dt_3 = DecisionTreeClassifier(random_state = 1)
param_grid_2 = {'max_depth' : np.arange(2,10,2),
                'criterion': ['gini'],
                'max_leaf_nodes' :np.arange(3,21,2),
                'min_samples_leaf':[3]}

#use Grid Search
gridSearch_2 = GridSearchCV(dt_3,param_grid_2, scoring='recall_weighted',cv=5)
gridSearch_2.fit(X_train,y_train)

best_params_2 = gridSearch_2.best_params_
best_recall_2 = gridSearch_2.best_score_


# print the best parameters for max depth and min number of nodes
print(best_params_2)

print(best_recall_2 )
{'criterion': 'gini', 'max_depth': np.int64(6), 'max_leaf_nodes': np.int64(19), 'min_samples_leaf': 3}
0.88254972070248
In [131]:
# Decision Tree model based on best parameters identified by Grid Search

dt_3 = DecisionTreeClassifier(random_state = 1, max_depth =best_params_2['max_depth'], max_leaf_nodes = best_params_2['max_leaf_nodes'],min_samples_leaf= best_params_2['min_samples_leaf'],criterion=best_params_2['criterion'])
dt_3.fit(X_train, y_train)



df_perf_dt3_a = measure_performance(X_train,y_train,dt_3 ,'Decision Tree GV Refined',13)
df_perf_dt3_b = measure_performance(X_test,y_test,dt_3 ,'Decision Tree GV Refined',14)
print ("************Model Performance****************")
print(df_perf_dt3_a)
print(df_perf_dt3_b)
Accuracy:0.892 

Classification report:
              precision    recall  f1-score   support

           0       0.93      0.93      0.93      3382
           1       0.71      0.72      0.72       790

    accuracy                           0.89      4172
   macro avg       0.82      0.83      0.83      4172
weighted avg       0.89      0.89      0.89      4172

Confusion Matrix
[[3155  227]
 [ 222  568]]
No description has been provided for this image
Accuracy:0.883 

Classification report:
              precision    recall  f1-score   support

           0       0.91      0.94      0.93      1389
           1       0.76      0.68      0.72       399

    accuracy                           0.88      1788
   macro avg       0.84      0.81      0.82      1788
weighted avg       0.88      0.88      0.88      1788

Confusion Matrix
[[1305   84]
 [ 126  273]]
No description has been provided for this image
************Model Performance****************
    num                model_name  Precision    Recall  Accuracy  F1_score
13   13  Decision Tree GV Refined   0.824363  0.825934  0.892378  0.825145
    num                model_name  Precision    Recall  Accuracy  F1_score
14   14  Decision Tree GV Refined   0.838328  0.811868   0.88255  0.823877
In [132]:
# combined performance from Random Forest classifier results
df_perf_dt3_combd = pd.merge(df_perf_dt3_a,df_perf_dt3_b,how='outer')
df_perf_combd_all = pd.merge(df_perf_combd_dt_rf ,df_perf_dt3_combd,how='outer')
df_perf_combd_all
Out[132]:
num model_name Precision Recall Accuracy F1_score
0 1 Logistic Regression Train 0.783357 0.627348 0.842522 0.656281
1 2 Logistic Regression Test 0.765874 0.635417 0.816555 0.659655
2 3 Logistic Regression w. RFE Train 0.653511 0.707224 0.755753 0.666418
3 4 Logistic Regression w RFe Test 0.650406 0.688414 0.730984 0.659554
4 5 Decision Tree Train 0.776443 0.817409 0.863615 0.793676
5 6 Decision Tree Test 0.806906 0.831450 0.868568 0.817922
6 7 DecisionTree_train_GV 0.831146 0.822709 0.895014 0.826820
7 8 DecisionTree_test_GV 0.833376 0.798803 0.877517 0.813973
8 9 randomForest_train 1.000000 1.000000 1.000000 1.000000
9 10 randomForest_test 0.893415 0.797112 0.897092 0.832181
10 11 randomForest_train_w_GV 0.944119 0.890963 0.950623 0.914582
11 12 randomForest_test_GV 0.855037 0.789005 0.883110 0.814978
12 13 Decision Tree GV Refined 0.824363 0.825934 0.892378 0.825145
13 14 Decision Tree GV Refined 0.838328 0.811868 0.882550 0.823877

ROC Curve¶

In [133]:
  # Logistic Regression Model
fpr1, tpr1, _ = roc_curve(y_test, rfe.predict(X_test))
roc_auc1 = auc(fpr1, tpr1)

   # Decision Tree Max depth 3
fpr2, tpr2, _ = roc_curve(y_test, dt.predict(X_test))
roc_auc2 = auc(fpr2, tpr2)


# DEcision tree w/ Grid Search MAx Depth and Min sample Leaf
fpr3, tpr3, _ = roc_curve(y_test, dt_1.predict(X_test))
roc_auc3 = auc(fpr3, tpr3)


#Random Forest with grid search


fpr4, tpr4, _ = roc_curve(y_test,rf_estimator_tuned.predict(X_test))
roc_auc4 = auc(fpr4, tpr4)



# Decision Tree w/ Max Depth and Max leaf nodes

fpr5, tpr5, _ = roc_curve(y_test, dt_3.predict(X_test))
roc_auc5 = auc(fpr5, tpr5)




plt.figure(figsize=(7,7))
plt.plot(fpr1, tpr1, color='orange', lw=2, label='Logistic Regression  ROC  (area = %0.2f)' % roc_auc1)
plt.plot(fpr2, tpr2, color='lightblue', lw=2, label='Decision Tree Max depth 3 ROC  (area = %0.2f)' % roc_auc2)
plt.plot(fpr3, tpr3, color='violet', lw=2, label='Decision Tree Grid Search (area = %0.2f)' % roc_auc3)
plt.plot(fpr4, tpr4, color='navy', lw=2, label='Random Forest Grid Search  (area = %0.2f)' % roc_auc4)
plt.plot(fpr5, tpr5, color='red', lw=2, label='Decision Tree Grid Search Refined (area = %0.2f)' % roc_auc5)
plt.plot([0, 1], [0, 1], color='black', lw=2, linestyle='--') # Random classifier line
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic  Comparison w/ Final Model Solution')
plt.legend(loc="lower right")
plt.show()
No description has been provided for this image
In [134]:
# Plote the Decision Tree with Grid Search with max depth of 5 ,19  max leaf nodes  and atleast 3 sample leaf
plt.figure(figsize=(25,10))
plot_tree(dt_3 ,feature_names=X_train.columns,filled=True,fontsize=12)
plt.tight_layout()
plt.show()

# Feature importance plot

importance_dt3_gv =dt_3.feature_importances_
importance_df_dt3_gv = pd.DataFrame(importance_dt3_gv, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)

sns.barplot(x=importance_df_dt3_gv.Importance,y=importance_df_dt3_gv.index)
No description has been provided for this image
Out[134]:
<Axes: xlabel='Importance', ylabel='None'>
No description has been provided for this image

Model Interpretability using SHAP values

In [ ]:
import shap as shap
shap.initjs() # Initialize JavaScript for interactive plots

# Fits the explainer
explainer = shap.Explainer(dt_3.predict, X_test)
# Calculates the SHAP values - It takes some time
shap_values = explainer(X_test)
No description has been provided for this image
PermutationExplainer explainer:  71%|███████▏  | 1275/1788 [00:41<00:15, 32.62it/s]
In [ ]:
# Feature impact
shap.summary_plot(shap_values)
In [ ]:
# Feature impact - on a particular loan
shap.plots.bar(shap_values[0])

2) Problem and Solution Summary¶

Provide the summary of the problem

  • The bank was looking to simplify their applicant approval process and wanted to build classification model which met following requirements
    • based on their existing loan data
    • developed using sound statistical analysis
    • interpretable and one that would provide justification for adverse behavior

State the reason for the proposed solution design

  • Different classification techniques were evaluated and model performance were compared.

  • The final solution was proposed after evaluating soundness and stability of models

Decision tree-based model was selected as it easy to interpret and had strong performance (with accuracy of 88% and Recall of ~81%) and showed comparable performance on test data

How it would affect the problem /business?

  • On implementing the model, the bank can introduce automated decision making in their application approval process, standardize it and reduce errors due to human judgement
  • With use of model in decision approval process, the applicant would need to enter data to get the score or decision . This will improve data that bank collects on the borrowers and also help in reducing defaults
  • Bank can also focus time on verifying applications which are eligible for approval based on the model rather than all the application.

Overall, use of model will enable bank improve their decision making process and if appropriately use bank can reduce the losses

3) Recommendations for implementation¶

What are some key recommendations to implement the solution?

  • The stakeholders need to define threshold or cut-off based on the new model to approve and decline customer.

  • There would be need to train underwriting team to know how to use the model in approval process and how to respond to questions related to model-based declined.

  • The model could be introduced as challenger to current process to evaluate its effectiveness in use and check if there is need to make any any adjustments before fully adopting the model for decision making.

What are the key actionables for stakeholders?

  • The stakeholders need to monitor effectiveness of the cut-off and may need to re-evaluate the cut-off on regular basis.

  • The stakeholders need to introduce model monitoring to analyze the population stability.

    • The model is developed on existing loans but if ther bank is undergoing any changes in customer base or due to better collection observes shift in population as compared to data used to develop the model, then there is need to re-build model.

What is the expected benefit and/or costs (List the benefits of the solution. State some rational assumptions to put forward some numbers on costs/benefits for stakeholders)?

What are the potential risks or challenges of the proposed solution design?

  • The application data required for model needs to be gathered and calculation should be correctly done before using it as model inputs. If there are errors or missing data, model decisions will be inaccurate.

  • Models that are incorrectly implemented or modified without stakeholder awareness—either prior to testing/validation or by members of the implementation team—can lead to unintended consequences in the application process."

What further analysis needs to be done or what other associated problems need to be solved?

  • As the data gathering and performance based application approved using new model is available, there is need to monitor effectiveness of current model and rebuild the model if needed

  • There is need to evaluate effectiveness of model on the rejects. The stakeholders need to look at validating the model performance on all application rather than only existing customers.

In [ ]:
! jupyter nbconvert --to html  '/content/drive/My Drive/GL_MIT_ADSP/Week9_Capstone/ChetanaShah_Capstone_Project_Notebook_Loan_Default_Prediction_Full_Code.ipynb'